October 2003 



UTAS-PHYS-03-07 



q-bio/ 



U(l)xU(l) xU(l) symmetry of the Kimura 3ST model 
and phylogenetic branching processes 



J D Bashford*, P D Jarvis* and J G Sumner, 

School of Mathematics and Physics, University of Tasmania 
GPO Box 252-21, Hobart Tas 7001, Australia 

and M A Steel*, 

Biomathematics Research Centre, University of Canterbury, 
Christchurch, New Zealand 



An analysis of the Kimura 3ST model of DNA sequence evolution is given on the basis 
of its continuous Lie symmetries. The rate matrix commutes with a U(l) x U(l) x U(l) 
phase subgroup of the group GL(A) of 4 x 4 invertible complex matrices acting on a 
linear space spanned by the 4 nucleic acid base letters. The diagonal 'branching operator' 
representing speciation is defined, and shown to intertwine the U(l) x U(l) x U(l) action. 
Using the intertwining property, a general formula for the probability density on the leaves 
of a binary tree under the Kimura model is derived, which is shown to be equivalent to 
established phylogenetic spectral transform methods. 
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The use of Markov models of stochastic change to taxonomic character distributions 
is part of the standard armoury of techniques for describing mutations and inferring 
ancestral relationships between taxa. For the simplest models, symmetries of the rate 
matrix under discrete group actions (Z2 for binary types, or Z2 x Z2 for DNA or RNA 
bases in molecular applications, for example) have been used to good effect in simplifying 
phylogenetic analysis. In particular, much attention has been centred on properties of the 
frequently used Kimura 3ST model [1, which possesses such symmetry. 

In this letter we describe an approach to the analysis of symmetries using continous 
transformation groups. Rather than identify the character types with elements of a (non- 
abelian or abelian) discrete 'colour' group which patterns the rate matrix for transitions 
between types into orbit classes in the traditional way, we look at linear transformations 
on the 'character space' spanned by the character types, and consider (complex, invert- 
ible) matrices which commute with the rate matrix. As we shall show, this approach, 
when implemented in the Kimura model, leads to an analysis which is well adapted to 
the established Hadamard discrete Fourier transform formalism [3J HJ EJ, but which 
importantly has potential generalisations going beyond the colour groups. 

Generically, let {p a (t),a = 1,...,K} be the probabilities that the system has trait 
a = 1, 2, . . . , K respectively. Introducing unit vectors e a , a — 1, . . . , K the state vector 
representing the system 

p(t) = Pi{t)e\ +P2{t)e 2 + ... +PK(t)e K (1) 
is subject to linear time evolution, 

j t p(t) = R-p(t), (2) 

where the operator R is a suitable K x K Markov rate matrix. It is natural to decompose 
R as 

R = \(-l + f) (3) 

where the traceless part T belongs by definition to the Lie algebra sl(K) (see below for 
the K = 4 case). The usual (positive) rates for substitution between different characters 
are thus the off-diagonal elements of T. A formal solution to (J2J) for time-independent 
rates is 

p(t) = e- xt -e xtf -p(0). (4) 

The vector p(t) in C K is a probability density if each p a > and J2 a P a = 1- Consistent 
with the time dependence imposed by the master equation, given a starting density, 
probability conservation is implemented by demanding that R is a unit column sum 
matrix. Introducing the vector Q representing the sum of all the unit vectors in the 
distinguished basis 

Q = e 1 + e 2 + . . . e K , 

probability conservation requires that the dual Q (the row vector with unit entries) is 
annihilated by R regarded as an operator on the dual space, Q 1 - ■ R = 0. Equivalently, 
Q 1 - is a left unit eigenvector of T. 



In the Kimura model the characters a are of course the standard nucleic acid base 
letters A, G, U and C, and the rate matrix is 1 
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wherein the total change rate parameter in ® above is A = a + p + 7, and the traceless 
part of the rate operator can be written in the form 
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provide a maximal set of commuting generators for the Lie algebra s/(4) and thus can be 
chosen as the basis for a Cartan subalgebra; equivalently there exists a transformation of 
the basis spanned by e^, &G-, £u and ec onto a new basis, in which the Kimura generators 
are diagonal (with doubly degenerate eigenvalues ±1 by the traceless property, and the 
fact that they are square roots of 1). This transformation is well known to be generated 
by the Hadamard matrix H, which sends Ki as matrices to HA'jH -1 , i G {a, 0,7}: 
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Note finally that H can be decomposed as a tensor product of two-dimensional forms 

1 1 



H = h ® h, 



Thus far, we have recovered the standard analysis, with the emergence of the Hadamard 
transformation as the key to resolving the Kimura model. For (multi)-taxon probability 
densities evolving independently, the time evolution after time t is by extension of (jlj) 
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lr rhe rate parameters a,/3, 7 describe base transitions, and two classes of transversions, respectively. 



where P is a tensor of rank > 2 carrying the probability density on a sample space spanned 
by the appropriate Cartesian product of character sets, and T := T<8>]1(8>. . . 1+1®T ... + ... 
the corresponding off-diagonal rate operator lifted to the tensor product space. Clearly, 
the higher rank Hadamard operator HgHg>. . . again implements the correct diagonalisation 
in this case. The work of [31111310] using discrete Fourier analysis on trees establishes that, 
remarkably, the Hadamard transform technique still applies, even when the multi-taxon 
system has evolved via a phylogenetic tree. 

In order to pursue the alternative analysis via Lie symmetries, we exploit the ob- 
servation that the Kimura operators K{, i e {a,/3,~f} provide a Cartan subalgebra for 
transformations belonging to the Lie algebra s/(4) of the group SL{4) of (complex) ma- 
trices 2 . Clearly the Hadamard basis vectors h a := H • e a are simultaneous eigenvectors 
of the Kimura generators. In a general representation of SL(4), the eigenvalues of the 
Kimura generators simply correspond to the weight decomposition with respect to the 
Cartan subalgebra. Aside from the overall scaling by e~ xt , the Markov transition opera- 
tor e~ xt ■ e xtT appends an exponential time dependence given by the sum of these weights, 
multiplied by the Kimura 'charge' parameters a, /3, 7. Thus, in principle, provided the 
Markov model respects the symmetry, its spectral properties, and hence the time develop- 
ment of a multi-taxon density, can be deduced from an appropriate weight decomposition 
of the corresponding tensor representation of SL(4). 

To confirm that the analysis does indeed carry through in the presence of phylogenetic 
trees, we now turn to the description of the branching process itself. The usual formalism 
of stochastic models of base substitution [2] can conveniently be encapsulated via a linear 
operator 5, which changes the state vector representing a single taxon, to that representing 
independent progeny after branching 3 . In the nucleic acid basis we have 

5 ■ e A = e A <S> e A , S ■ e G = e G <g> e G , 

5 ■ e v = e v <g> e v , S ■ e c = e c <E> e c , (10) 

so that, when applied to a vector p representing the density on bases for one taxon, we 
have 

V = PAe A + PgGg + Pueu + Vc^c 
-> 5 ■ p = p A e A ® e A + p G e G <g> e G + p v eu <g> e v + p G e G ®e G , (11) 

after which evolution proceeds for the model on two taxa (with all operations lifted to 
the tensor product space carrying the Cartesian square of the character set, as described 
by © above). 

A stochastic model may be said to possess a symmetry under a continuous transfor- 
mation group G if the rate matrix commutes with its generators, and hence intertwines 
the group action. Formally if the action is p(t) — > p'(t) = g-p(t) then the master equation 
(J2J) retains its form, for all g G G and for arbitrary p(t), as 



2 SL(4) ~ GL(4)/C X where invertible matrices are factored by the multiplicative group of complex 
numbers corresponding to their (nonzero) determinants. 

3 A dynamical, many-body formulation of phylogenetic branching processes has been presented in 0. 



iff gR = Rg, or [R, K] = with K a generator of the group G (g ~ e ). Similarly 
a branching operator 5 admits a symmetry under such transformations if it intertwines 
the action of G on the character space of a single taxon, with some action on the tensor 
product space: 

5og = go5. (12) 

Such symmetry considerations lead to useful ways of analysing the tree structure 
of general phylogenetic branching processes, which we hope to take up in a separate 
work. Here we examine the implications for the Kimura model as a first example. It is 
clear from the above remarks that the rate matrix admits a GL(1) x GL(1) x GL(1) ~ 
C x xC x xC x group of symmetry tansformations. For present purposes it is sufficient 
to take the corresponding unitary phase subgroup, £7(1) x £7(1) x £7(1). Turning to 
the diagonal branching operator f(10j) . it is obvious that the Kimura generators in the 
distinguished basis simply act as permutations of the basic unit vectors e A , e G , eu, &c and 
hence themselves have a diagonal intertwining property 4 : 

5oKi = Ki®K t o5, ie {a,P, 7}. (13) 

Thus we conclude that the Kimura model has £7(1) x £7(1) x £7(1) symmetry, both in 
the sense of commuting with the rate matrix, and in the intertwining property for the 
branching operator. 

With the above preliminaries we sketch briefly the the way in which the above algebraic 
structure can be applied to an analysis of the Kimura model for phylogenetic trees, which 
is consistent with the Fourier transform methods. Fixing a rooted tree on L leaves, the 
full time evolution from the initial root density to the leaf density can be represented 
abstractly as a product of strings of terms of the form 

. . . (M[ <g> M' 2 <g> . . . ® M; +1 ) • (1 ® 1 ® . . . 8 ® . . . <g> 1) • (Mi <g> M 2 <g> . . . <g> M r ) . . . (14) 

where it is implied that, for the time slices of the tree under consideration, with r taxa 
evolving, a branching event 5 took place on a particular edge leading to r + 1 taxa evolving, 
the M being simply the appropriate Markov transition matrices e AtR . The intertwining 
property (fE?j) can now be used to pull all the 5 operators back to the root node, so that 
the final expression for the leaf density is of the form of products of exponentials of tensor 
products of Kimura operators, acting on the fully branched state 6 

5 {L - 1] p(0) = p A (0)e A ®e A ...®e A + p G (0)e G ® e G . . . <8> e G + 

Pu(0)eu (g) eu ■ ■ ■ <8> &u + Pc(0)ec ® e G . . .® e G . (15) 

Working in the Hadamard basis allows the exponentials to be diagonalised in terms of 
the weights of the tensor product states under the induced £7(1) x £7(1) x £7(1) action. 

4 In the case of abelian algebras, a 'group-like' coproduct K — » K®K gives a coassociative coalgebra 
structure, and the tensor product spaces carry a consistent comodule action. 
5 See also 0. 

6 The operator 5 is coassociative, (1<S) 5)oS = (5 ® t)oS = S^K 



The combinatorics of the tree is of course encoded, in that the change on each edge 
explicit in (|14j) is inherited by the differing total weights of each factor, and hence different 
exponential time dependence, in the L edges emanating from (j!5|) above. 

As an example we specialise to the binary character case (the symmetric two colour 
model jHlin])- Suppose the character set is {Y, R} for definiteness. The analogue of the 
Kimura operator is k, and there is only one rate parameter a with R = a(— t + k). The 
analogue of (0) is 
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Consider the descending rooted 4-leaf tree (1(2(34))). Labelling the non-leaf edges 5,6 
in order of ascending level away from the leaves, define the total edge change parameters 
(including time intervals) as 
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(effectively allowing the a parameter in the rate matrix to be edge-dependent), and also 
the leaf operators 
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Applying (fTB^) . (JHJ), we have for the leaf density 



where also ks 



P lea j = e ~T,e a e . e «lki+a 2 k2+Q3k3+a4k4+0 5 k5+a6k6 . £ 3 p(0) 

■ 1® 1® k® k, k 6 = 1 ® k <g> k ® k. (17) 



The composite operator in (fT7j) acts in the Hadamard basis to give a signed sum of edge 
parameters, with the signs determined by products of k-weights, eigenvalues of the various 
leaf operators acting on <5^»(0) = py(0)ey ®ey ®ey ®ey+p#(0)e#®e#®eR®e# expanded 
via the inverse Hadamard transform (see ()16j0 . 
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Multiplying through by the overall prefactor, the positively signed edge parameters cancel 
in the exponent. For example the coefficient of h + ® /i_ ® h + ® /i_ in the expansion of 
(fT7|) becomes 
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As explained above, the use of (|13jh ()14|) in generalising (|T7|) to an arbitrary tree 
T amounts to considering how the symmetry group on the linear space spanned by the 
evolving probability density of a single system, extends after branching to transformations 
acting on the L-fold tensor product (in the Kimura 3ST model, the symmetry group is 



£7(1) x[/(l)x £7(1), and in the symmetric two-colour model just £7(1)). Taking the binary 
case for simplicity, the general form of (fT7j) reads (cf Q and (J3J)) 

Pieaf=e-^^-e^-5^p(0). (19) 

The operator kr is the induced generator of £7(1) after pulling back through the branching 
nodes of the tree. We define (following the above example) 

k (e) = Yl % 

for each edge e to be the product, over all leaves in the subtree T e determined by e, of 
the leaf operators 

k £ = l®l(g>...k<g)l<g>...<g>l 

(k acting on the £'th place in the L-fold tensor product - obviously if £ is a leaf edge, 
k w = k e ). Then 

kr = y^a e k (e ). 

e 

Finally, 5 L ~ 1 p(0) is the maximally branched state (as would derive from a mult if ur eating 
branching). Note however, that this decomposition does not imply that the leaf density 
is equivalent to independent stochastic evolution from this initial branched state - the 
operator kr is not of separable form. 

While (119(1 is independent of basis, it is obviously beneficial to analyse the components 
of each side in terms of the (tensor products of) Hadamard vectors (eigenstates of the 
k operator), as both the separable and non-separable parts of the tree operator kr are 
diagonal in this basis. Briefly the algorithm for determining the weight attributed to a 
term of Pi ea f i n the Hadamard basis can be described as follows (for a formal analysis 
see [El])- Take an arbitrary binary tree, and fix a tree 'split', to be associated with 
the coefficient, in the expansion of Pi ea f, of the basis element consisting of the L-fold 
tensor product of — Hadamard vectors on a chosen subset of distinguished leaves, with 
+ Hadamard vectors at the remaining non-distinguished leaf positions. On the graph of 
the tree assign — signs to the distinguished leaf edges, and + signs to the remainder, 
and propagate signs to the remaining edges multiplicatively (eg adjacent siblings with — 
signs will generate a + sign on their ancestral edge). The corresponding signed sum of 
edge parameters a e is precisely the exponent generated by the action of kr on this basis 
element. After the overall e~^ a <= prefactor is multiplied through, only the negatively 
signed edge terms are present in the exponent (with coefficient -2). Finally the numerical 
factors accompanying the terms proportional to py(0) and Pr(0) can easily be read off 
from IJIKJl . 

It is clear that the above presentation is equivalent to the standard discrete Fourier 
analysis on trees techniques involving the Hadamard transform (21131 El E]- Specifically, the 
surviving edge parameters which provide the argument of the exponential are nothing but 
the nonintersecting path edge sums for a given leaf split, as emerges from the Hadamard 
transform in edge space. The standard Z2 x Z2 colour symmetry is of course inherent in 



the Hadamard matrix, which is also mandatory for the simultaneous diagonalisation of the 
Kimura generators. However, from the viewpoint of Lie symmetries, the latter determine 
3 (infinite) continuous symmetry groups, rather than being identified with the 3 non-unit 
elements of a discrete group. Crucial for our derivation is the coproduct property (fT2j) . 
and the fact that the combinatorics of the tree determines the final action of the symmetry 
group on the L-fold tensor product carrying the leaf probability density. 

In this letter we have provided a framework for the analysis of phylogenetic branching 
models on the basis of continuous transformation symmetries of the rate matrix and the 
branching operator. The formalism can be applied to the Kimura 3ST (and also the 
2P) model, as well as the symmetric binary character model [KJ E] and it reproduces 
the standard spectral transform analysis. Importantly, it can potentially be applied to 
any model where the off-diagonal rates can be associated with an abelian subalgebra of 
SL(K) whose generators have the form of permutation matrices (so that the intertwining 
property holds). We defer a formal presentation of such generalisations, and of the role 
of Lie symmetries and representation theory in branching models to a separate paper [TU]. 
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